A good example of some more realistic cloud-native (spot) pricing is given in RETRO Is Blazingly Fast where the cost (~$1k) was largely dominated by re-embedding the entire dataset (note, they did not re-train), as quoted below:
Tokenization takes around 1.9 min / 1M chunks on your standard CPU core. The Pile ends up being around 5.8B chunks (370B tokens), so that means you’re looking at ~180 hours of CPU time to tokenize, which you can easily parallelize down to only a few hours of wall time.
With a CPU core on the cloud going for around $0.03 / hour, that means you’ll spend less than $10 on tokenization.
BERT embedding is the most expensive step. On an RTX A5000, BERT embedding takes around 10 minutes per 1M chunks. That’s around 1k GPU hours to embed The Pile, which again is trivial to parallelize. This cost around $1k on CoreWeave.
Note that BERT embeddings are around 3 KB each on disk. (768 float32s). 5.8B of them takes up about 16 TB on disk, so watch out for that. (Disk space is cheap.)
One particular trick used by FAISS (the inverted file structure) requires taking a small percentage of the data (64M embeddings) and using them to train the index. On a V100 GPU, this only took around 4 hours, so the cost was negligible.
Once the index is trained, we can add all the embeddings to the index, compressing them for lookup. This takes longer than you’d expect (around 192 CPU hours) but ultimately only represents a cost of <$30.
The FAISS index is not totally cost free. The index itself ends up being big, requiring around 176 GB of RAM to query, which costs about $0.88 per hour on your average cloud provider.
However, this allows you to drastically reduce your GPU usage. Say, for example, you need 5 GPUs running in parallel to do inference on a 175B parameter model, which costs around $6 an hour. By adding an extra $0.88 / hour in CPU RAM, you can reduce the number of GPUs you have to run to just 1, saving around $5 / hour in GPU costs.
This also applies to models that are already using a single GPU. By shrinking your model with RETRO’s database, requests get served faster, meaning more GPU bang for your buck. Instead of serving 60 req / hour on a single GPU, you’re serving 600+, just for a little extra CPU RAM.
FAISS has the ability to memory map an index, which allows you to read it directly from disk instead of allocating RAM for it. This is slightly slower, of course, but probably worth the trade.
vnc enabled to give the driver a VGA buffer on FreeBSD and Windows guestspg_analytics, usearch (2–20x FAISS)), Postgres (pgvector/Lantern/pgvecto.rs, pgai/pgvectorscale), Python, Ibis, NumPy, Numba, SciPy, PySR, StringZilla, FAISS/usearch (and assorted vector similarity / (k-/approximate-)nearest-neighbour search (DiskANN), caching (e.g. Varnish), filters (both Bloom/Cuckoo/XOR filters (e.g. Bitfunnel) and particle/Kalman filters), (inverted) indexing methods, and underlying trees (e.g. for B/T/R(*) trees and BK/PH/(M)VP/M/cover metric trees) or skiplists, like SortedContainers)flaxmodels, Haiku, Trax, Jraph, Objax, Levanter, sklearn-jax-kernels, efficientnet-jaxensemble_net & ood_focal_loss, Lorax, AQT (Accurate Quantized Training)f_net FNet, T5X, MaxText, GPT-JCommonLoopUtils, JAX-tqdmsympy2jax, scalable_shampoo Shampoo, generax, jax-flows, parallel-non-linear-gaussian-smoothers (Kalman Filtering)JAXNS, Oryx, MCX, jxfMctx, Jumanjipykan, efficient-kan, FourierKAN, GraphKAN, kanrlUCall